Introduction to Exploratory Data Analysis (EDA)

What is EDA?

Exploratory Data Analysis (EDA) is the first and crucial step when working with any dataset. It allows us to familiarize ourselves with the data by summarizing its main characteristics, using graphs and visualizations, and forming informed hypotheses. EDA helps in driving the selection of features for modeling by understanding the data and uncovering patterns, trends, missing values, anomalies, or relationships that guide further analysis.

Key Steps in EDA

  • Understanding Variable Types: Identify and understand the different types of variables in your dataset.
  • Univariate Analysis: Summarize and visualize each variable individually to understand its distribution and basic statistics using histograms, box plots, and measures of central tendency.
  • Bivariate Analysis: Explore the relationship between two variables at a time with scatter plots, correlation analyses, and similar techniques.
  • Multivariate Analysis: Investigate relationships among more than two variables using techniques such as correlation matrices, pair plots, and dimensionality reduction methods like Principal Component Analysis (PCA).
  • Data Cleaning and Transformation: Address missing values, transform variables, engineer new features, and select relevant features for further analysis.

Introducing the Titanic Dataset

For this tutorial, we will use the Titanic dataset, a classic dataset in data science and machine learning. The Titanic dataset provides information about the passengers who were aboard the RMS Titanic, which tragically sank on its maiden voyage in 1912.

This dataset is rich in both categorical and numerical variables, making it ideal for practicing Exploratory Data Analysis (EDA). It includes information such as:

  • PassengerId: A unique identifier for each passenger.
  • Survived: Whether the passenger survived (1) or did not survive (0).
  • Pclass: The passenger’s class (1st, 2nd, or 3rd).
  • Name: The name of the passenger.
  • Sex: The gender of the passenger.
  • Age: The age of the passenger.
  • SibSp: The number of siblings or spouses the passenger had aboard the Titanic.
  • Parch: The number of parents or children the passenger had aboard the Titanic.
  • Ticket: The ticket number.
  • Fare: The amount of money the passenger paid for the ticket.
  • Cabin: The cabin number (if available).
  • Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

The Titanic dataset is widely used for tutorials because it offers a variety of data types and challenges, such as missing values, categorical data, and potential interactions between variables. It’s a great way to practice EDA techniques and gain insights that can guide more advanced analyses or predictive modeling.

Throughout this tutorial, we will perform an Exploratory Data Analysis on the Titanic dataset to understand the characteristics of the passengers, explore survival rates, and identify key factors that might have influenced survival.

Loading and Understanding Your Data

install.packages("tidyverse")
## The following package(s) will be installed:
## - tidyverse [2.0.0]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing tidyverse ...                      OK [linked from cache]
## Successfully installed 1 package in 15 milliseconds.
install.packages("plyr")
## The following package(s) will be installed:
## - plyr [1.8.9]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing plyr ...                           OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("patchwork")
## The following package(s) will be installed:
## - patchwork [1.2.0]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing patchwork ...                      OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("reshape2")
## The following package(s) will be installed:
## - reshape2 [1.4.4]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing reshape2 ...                       OK [linked from cache]
## Successfully installed 1 package in 13 milliseconds.
install.packages("GGally")
## The following package(s) will be installed:
## - GGally [2.2.1]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing GGally ...                         OK [linked from cache]
## Successfully installed 1 package in 12 milliseconds.
install.packages("factoextra")
## The following package(s) will be installed:
## - factoextra [1.0.7]
## These packages will be installed into "~/Desktop/VŠCHT/PhD/vyuka/statisticka_analyza_dat/statistical_anaysis_in_R/Exploratory_data_analysis/renv/library/macos/R-4.4/x86_64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing factoextra ...                     OK [linked from cache]
## Successfully installed 1 package in 14 milliseconds.
library(patchwork)
library(plyr)
library(tidyverse)
library(dplyr)
library(tidyr)
library(reshape2)
library(GGally)
library(factoextra)

Importing Data into R

library(readr)
titanic_data <- read_csv("titanic.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(titanic_data)
## # A tibble: 6 × 12
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           6        0      3 Moran,… male     NA     0     0 330877  8.46 <NA> 
## # ℹ 1 more variable: Embarked <chr>

Understanding the Structure of the Data

The str() function provides a concise summary of the dataset, showing the number of observations (rows), the number of variables (columns) and the data type for ieach variable (e.g. integer, factor, character).

str(titanic_data)
## spc_tbl_ [891 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PassengerId: num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr [1:891] "male" "female" "female" "female" ...
##  $ Age        : num [1:891] 22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr [1:891] NA "C85" NA "C123" ...
##  $ Embarked   : chr [1:891] "S" "C" "S" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PassengerId = col_double(),
##   ..   Survived = col_double(),
##   ..   Pclass = col_double(),
##   ..   Name = col_character(),
##   ..   Sex = col_character(),
##   ..   Age = col_double(),
##   ..   SibSp = col_double(),
##   ..   Parch = col_double(),
##   ..   Ticket = col_character(),
##   ..   Fare = col_double(),
##   ..   Cabin = col_character(),
##   ..   Embarked = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary Statistics

To get a quick summary of each variable including basic descriptive statistics for numerical variables and frequency counts for categorical variables, use the summary() function.

summary(titanic_data)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
## 

Identifying Missing Values

colSums(is.na(titanic_data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Here we can see that the columns Age, Cabin, Embarked contain missing values, we will need to pay attention to this during data cleaning.

Data Cleaning and Preparation

Before diving deeper into analysis, it’s crucial to ensure that your data is clean, consistent, and ready for exploration. Data cleaning and preparation is a foundational step in any data analysis project, where we address issues like missing values, inconsistencies, and incorrect data types. This process not only improves the quality of your analysis but also helps in drawing more accurate and reliable conclusions.

In this chapter, we will walk through the essential techniques and best practices for cleaning and preparing your dataset. By the end of this section, you’ll be equipped to handle common data challenges and transform raw data into a structured format that’s ready for analysis.

Handling Missing Values

Common approaches to dealing with missing data:

  • remove rows with missing values entirely
  • imputing missing data with
    • mean/median/mode
    • neighboring values
    • predicted values
  • leaving missing data as is (tree-based methods)

What do you think are their pros & cons?

Missing values in the Age column

# let's first examine the distribution of values in the Age column
library(ggplot2)
ggplot(titanic_data, aes(x=Age)) + geom_histogram(bins = 30)
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Notice the warning caused by the 177 missing values.

qqnorm(titanic_data$Age)
qqline(titanic_data$Age, col="steelblue", lwd = 2)

We see that the distribution of Age is skewed to the left (positive skew). We will use the most common value to fill in the missing ones. Which is why we will use

  • the mean,
  • the mode or
  • the median?
# create frequency table and get mode
frequency_table <- table(titanic_data$Age)
mode_value <- as.numeric(names(frequency_table)[which.max(frequency_table)])
mode_value
## [1] 24

Unfortunately there is no implemented function for mode in R, so we will create a frequency table for all the values in the Age column and pick the most frequent one, which in this case is 24.

# replace missing values in the Age column with the mode
titanic_data <- titanic_data %>% mutate(Age = replace_na(Age, mode_value))
sum(is.na(titanic_data$Age))
## [1] 0
ggplot(titanic_data, aes(x=Age)) + geom_histogram(bins = 30)

From the histogram after imputation we can see that we modified the distribution significantly. It is a compromise that we need to take because in this case our best guess is the mode.

Missing values in the Cabin column

687 out of 891 of the values in the Cabin column are missing which tells us that this column is not very informative or the survival probability. For this specific case, we will remove the column from the dataset entirely.

# drop the Cabin column
titanic_data <- titanic_data %>% select(-Cabin)
head(titanic_data)
## # A tibble: 6 × 11
##   PassengerId Survived Pclass Name          Sex     Age SibSp Parch Ticket  Fare
##         <dbl>    <dbl>  <dbl> <chr>         <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
## 1           1        0      3 Braund, Mr. … male     22     1     0 A/5 2…  7.25
## 2           2        1      1 Cumings, Mrs… fema…    38     1     0 PC 17… 71.3 
## 3           3        1      3 Heikkinen, M… fema…    26     0     0 STON/…  7.92
## 4           4        1      1 Futrelle, Mr… fema…    35     1     0 113803 53.1 
## 5           5        0      3 Allen, Mr. W… male     35     0     0 373450  8.05
## 6           6        0      3 Moran, Mr. J… male     24     0     0 330877  8.46
## # ℹ 1 more variable: Embarked <chr>

Missing values in the Embarked column

The Embarked column is a categorical one with three possible values: S, C or Q. The most frequent one is S (644 occurrences out of 889). We will replace the missing values with the most frequent one.

ggplot(titanic_data, aes(x = Embarked)) + geom_bar()

# replace the missing values in the Embarked column
count_table <- titanic_data %>% plyr::count("Embarked")
most_frequent <- count_table %>% filter(freq == max(freq)) %>% pull("Embarked")
titanic_data <- titanic_data %>% mutate(Embarked = replace_na(Embarked, most_frequent))
sum(is.na(titanic_data$Embarked))
## [1] 0
ggplot(titanic_data, aes(x = Embarked)) + geom_bar()

Finally let’s convert the column type to categorical.

titanic_data$Embarked <- as.factor(titanic_data$Embarked)

Data Transformation

Data transformation is a crucial step in data preparation and analysis. It involves modifying, reshaping, or aggregating your data to make it more suitable for analysis. This process allows you to derive new insights, prepare data for modeling, and ensure that the dataset is in the right format for visualization or further processing.

Column names

# check if columns are named consistently
colnames(titanic_data)
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Embarked"
# some of the column names are not self-explanatory, some are also shortened, let's unify them
titanic_data <- titanic_data %>% rename(passenger_id = PassengerId, survived = Survived, passenger_class = Pclass, name = Name, sex = Sex, age = Age, siblings_spouses = SibSp, parents_children = Parch, ticket = Ticket, fare = Fare, embarked = Embarked)
colnames(titanic_data)
##  [1] "passenger_id"     "survived"         "passenger_class"  "name"            
##  [5] "sex"              "age"              "siblings_spouses" "parents_children"
##  [9] "ticket"           "fare"             "embarked"

Changing column data types

# the intention here is to minimize the number of unique values in each column to reduce dimensionality
lapply(titanic_data %>% select(-passenger_id, -name, -ticket), unique) # exclude columns that obviously have a unique value in each row
## $survived
## [1] 0 1
## 
## $passenger_class
## [1] 3 1 2
## 
## $sex
## [1] "male"   "female"
## 
## $age
##  [1] 22.00 38.00 26.00 35.00 24.00 54.00  2.00 27.00 14.00  4.00 58.00 20.00
## [13] 39.00 55.00 31.00 34.00 15.00 28.00  8.00 19.00 40.00 66.00 42.00 21.00
## [25] 18.00  3.00  7.00 49.00 29.00 65.00 28.50  5.00 11.00 45.00 17.00 32.00
## [37] 16.00 25.00  0.83 30.00 33.00 23.00 46.00 59.00 71.00 37.00 47.00 14.50
## [49] 70.50 32.50 12.00  9.00 36.50 51.00 55.50 40.50 44.00  1.00 61.00 56.00
## [61] 50.00 36.00 45.50 20.50 62.00 41.00 52.00 63.00 23.50  0.92 43.00 60.00
## [73] 10.00 64.00 13.00 48.00  0.75 53.00 57.00 80.00 70.00 24.50  6.00  0.67
## [85] 30.50  0.42 34.50 74.00
## 
## $siblings_spouses
## [1] 1 0 3 4 2 5 8
## 
## $parents_children
## [1] 0 1 2 5 3 4 6
## 
## $fare
##   [1]   7.2500  71.2833   7.9250  53.1000   8.0500   8.4583  51.8625  21.0750
##   [9]  11.1333  30.0708  16.7000  26.5500  31.2750   7.8542  16.0000  29.1250
##  [17]  13.0000  18.0000   7.2250  26.0000   8.0292  35.5000  31.3875 263.0000
##  [25]   7.8792   7.8958  27.7208 146.5208   7.7500  10.5000  82.1708  52.0000
##  [33]   7.2292  11.2417   9.4750  21.0000  41.5792  15.5000  21.6792  17.8000
##  [41]  39.6875   7.8000  76.7292  61.9792  27.7500  46.9000  80.0000  83.4750
##  [49]  27.9000  15.2458   8.1583   8.6625  73.5000  14.4542  56.4958   7.6500
##  [57]  29.0000  12.4750   9.0000   9.5000   7.7875  47.1000  15.8500  34.3750
##  [65]  61.1750  20.5750  34.6542  63.3583  23.0000  77.2875   8.6542   7.7750
##  [73]  24.1500   9.8250  14.4583 247.5208   7.1417  22.3583   6.9750   7.0500
##  [81]  14.5000  15.0458  26.2833   9.2167  79.2000   6.7500  11.5000  36.7500
##  [89]   7.7958  12.5250  66.6000   7.3125  61.3792   7.7333  69.5500  16.1000
##  [97]  15.7500  20.5250  55.0000  25.9250  33.5000  30.6958  25.4667  28.7125
## [105]   0.0000  15.0500  39.0000  22.0250  50.0000   8.4042   6.4958  10.4625
## [113]  18.7875  31.0000 113.2750  27.0000  76.2917  90.0000   9.3500  13.5000
## [121]   7.5500  26.2500  12.2750   7.1250  52.5542  20.2125  86.5000 512.3292
## [129]  79.6500 153.4625 135.6333  19.5000  29.7000  77.9583  20.2500  78.8500
## [137]  91.0792  12.8750   8.8500 151.5500  30.5000  23.2500  12.3500 110.8833
## [145] 108.9000  24.0000  56.9292  83.1583 262.3750  14.0000 164.8667 134.5000
## [153]   6.2375  57.9792  28.5000 133.6500  15.9000   9.2250  35.0000  75.2500
## [161]  69.3000  55.4417 211.5000   4.0125 227.5250  15.7417   7.7292  12.0000
## [169] 120.0000  12.6500  18.7500   6.8583  32.5000   7.8750  14.4000  55.9000
## [177]   8.1125  81.8583  19.2583  19.9667  89.1042  38.5000   7.7250  13.7917
## [185]   9.8375   7.0458   7.5208  12.2875   9.5875  49.5042  78.2667  15.1000
## [193]   7.6292  22.5250  26.2875  59.4000   7.4958  34.0208  93.5000 221.7792
## [201] 106.4250  49.5000  71.0000  13.8625   7.8292  39.6000  17.4000  51.4792
## [209]  26.3875  30.0000  40.1250   8.7125  15.0000  33.0000  42.4000  15.5500
## [217]  65.0000  32.3208   7.0542   8.4333  25.5875   9.8417   8.1375  10.1708
## [225] 211.3375  57.0000  13.4167   7.7417   9.4833   7.7375   8.3625  23.4500
## [233]  25.9292   8.6833   8.5167   7.8875  37.0042   6.4500   6.9500   8.3000
## [241]   6.4375  39.4000  14.1083  13.8583  50.4958   5.0000   9.8458  10.5167
## 
## $embarked
## [1] S C Q
## Levels: C Q S
# even though the data type of the age column is float, most values in the age column are an integer, let's convert the values to the closest integer
titanic_data <- titanic_data %>% mutate(age = round(age))
unique(titanic_data$age)
##  [1] 22 38 26 35 24 54  2 27 14  4 58 20 39 55 31 34 15 28  8 19 40 66 42 21 18
## [26]  3  7 49 29 65  5 11 45 17 32 16 25  1 30 33 23 46 59 71 37 47 70 12  9 36
## [51] 51 56 44 61 50 62 41 52 63 43 60 10 64 13 48 53 57 80  6  0 74

We reduced the number of unique values in the age column from 88 to 71. Another float column with a lot of unique values if the fare column. Let’s try to see if the correlation with the survived column changes if we round these values.

# titanic_data <- titanic_data %>% mutate(fare_rounded = round(fare))
# cor(titanic_data$fare, titanic_data$survived)
# cor(titanic_data$fare_rounded, titanic_data$survived)
# titanic_data <- titanic_data %>% select(-fare) %>% rename(fare = fare_rounded)
# length(unique(titanic_data$fare))

The correlation coefficient doesn’t get affected much and we reduced the number of unique values from 248 to 90!

The next step is to convert the sex column to categorical.

titanic_data$sex <- as.factor(titanic_data$sex)

Creating new columns

Here we will focus on creating new columns. Is there a column that combines more pieces information that could be split into more columns? Let’s look at the name column. What can we extract from it?

Extracting title from name

head(titanic_data$name)
## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

It seems that the name column follows the format: last_name, title first_name(s) “nickname” (more_first_names?).

# first we split the last name from the title and the last name separated by ","
titanic_data <- titanic_data %>% mutate(title_and_first_names = (str_split(name, ", ", simplify = TRUE))[,2])
titanic_data <- titanic_data %>% mutate(title = str_split(title_and_first_names, " ", simplify = TRUE)[, 1])
titanic_data <- titanic_data %>% select(-title_and_first_names)
head(titanic_data$title)
## [1] "Mr."   "Mrs."  "Miss." "Mrs."  "Mr."   "Mr."
ggplot(titanic_data, aes(x = title)) + geom_bar()

Now that we successfuly extracted the title, let’s look at the values. We see that there are some dominant titles like Mr., Mrs., Miss., Master and maybe even Dr. and Rev. The remaining titles are rare. To reduce the number of unique values let’s classify these rare titles into Other.

titanic_data <- titanic_data %>% mutate(
    title = if_else(title %in% c("Mr.", "Mrs.", "Miss.", "Master.", "Dr.", "Rev."), 
                    title, 
                    "Other")
  )
ggplot(titanic_data, aes(x = title)) + geom_bar()

# finally let's convert the title column to categorical
titanic_data$title <- as.factor(titanic_data$title)
str(titanic_data)
## tibble [891 × 12] (S3: tbl_df/tbl/data.frame)
##  $ passenger_id    : num [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ survived        : num [1:891] 0 1 1 1 0 0 0 0 1 1 ...
##  $ passenger_class : num [1:891] 3 1 3 1 3 3 1 3 3 2 ...
##  $ name            : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ sex             : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ age             : num [1:891] 22 38 26 35 35 24 54 2 27 14 ...
##  $ siblings_spouses: num [1:891] 1 1 0 1 0 0 0 3 0 1 ...
##  $ parents_children: num [1:891] 0 0 0 0 0 0 0 1 2 0 ...
##  $ ticket          : chr [1:891] "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ fare            : num [1:891] 7.25 71.28 7.92 53.1 8.05 ...
##  $ embarked        : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ title           : Factor w/ 7 levels "Dr.","Master.",..: 4 5 3 5 4 4 4 2 5 5 ...

Dealing with Outliers

# let's identify outliers of the numerical columns

p1 <- ggplot(titanic_data, aes(x = age)) + geom_boxplot() + coord_flip()
p2 <- ggplot(titanic_data, aes(x = siblings_spouses)) + geom_boxplot() + coord_flip()
p3 <- ggplot(titanic_data, aes(x = parents_children)) + geom_boxplot() + coord_flip()
p4 <- ggplot(titanic_data, aes(x = fare)) + geom_boxplot() + coord_flip()

combined_plot <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4)

combined_plot

Discussion: how would you treat these outliers?

Normalizing/Standardizing Data

Introduction to Normalization and Standardization

  • Normalization: Scaling features to a fixed range, typically [0, 1]. Useful when you need features to be on a similar scale, especially for algorithms that are sensitive to the magnitude of features, like neural networks. Many algorithms perform better when features are on a similar scale. Features with larger scaled can dominate distance metrics and gradient-based methods.
  • Standardization: Transforming features to have a mean of 0 and a standard deviation of 1. Useful when data needs to have the properties of a standard normal distribution, which is important for algorithms that assume normally distributed data, like linear regression.

Some Exaplmes of Methods for Normalizing/Standardizing Data

  • Min-Max Normalization (x - min(x)) / (max(x) - min(x))
  • Z-Score Standardization (x - mean(x)) / sd(x)
  • Robust Scaling (x - median(x)) / IQR(x)

We have 4 numerical columns: siblings_spouses, parents_children, age and fare. When it comes to the first two, even though they are quantitative, they have specific interpretations in their raw form so we won’t normalize/standardize them. The age and fare column however we’ll normalize.

# min-max normalization of the age column
normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

titanic_data <- titanic_data %>%
  mutate(
    age = normalize(age)
  )

ggplot(titanic_data, aes(x = age)) + geom_histogram(bins = 30)

# robust scaling of the fare column
robust_scale <- function(x) {
  (x - median(x, na.rm = TRUE)) / IQR(x, na.rm = TRUE)
}

titanic_data <- titanic_data %>%
  mutate(
    fare = robust_scale(fare)
  )

ggplot(titanic_data, aes(x = fare)) + geom_histogram(bins = 30)

Bivariate Analysis

Bivariate Analysis examines the relationship between two variables to understand their interactions and correlations. Techniques used include scatter plots, which visually depict the relationship between two variables, and correlation coefficients, which quantify the strength and direction of their linear relationship. Bivariate analysis helps in identifying patterns, associations, and potential predictive relationships between variables.

Correlation Analysis

Correlation Analysis evaluates the strength and direction of the linear relationship between two variables. It is quantified using the correlation coefficient, which ranges from -1 to 1. A coefficient close to 1 indicates a strong positive relationship, close to -1 indicates a strong negative relationship, and around 0 suggests no linear relationship. Correlation analysis helps in understanding how changes in one variable might be associated with changes in another.

correlation_matrix <- round(cor(titanic_data %>% select(survived, passenger_class, age, siblings_spouses, parents_children, fare)), 2) # considering only numerical columns
melted_correlation_matrix <- melt(correlation_matrix)
ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) + geom_tile()

Scatter Plots

Scatter Plot is a graphical representation that displays the relationship between two continuous variables. Each point on the plot represents an observation, with its position determined by the values of the two variables. Scatter plots help visualize trends, correlations, and the distribution of data, making it easier to identify patterns, clusters, or outliers.

p1 <- ggplot(titanic_data, aes(x = passenger_class, y = survived)) +
  geom_point(alpha = 0.5) +
  labs(title = "Passenger class vs Survived")

p2 <- ggplot(titanic_data, aes(x = age, y = survived)) +
  geom_point(alpha = 0.5) +
  labs(title = "Age vs Survived")

p3 <- ggplot(titanic_data, aes(x = siblings_spouses, y = survived)) +
  geom_point(alpha = 0.5) +
  labs(title = "Number of siblings and spouses vs Survived")

p4 <- ggplot(titanic_data, aes(x = parents_children, y = survived)) +
  geom_point(alpha = 0.5) +
  labs(title = "Number of parents and children vs Survived")

p5 <- ggplot(titanic_data, aes(x = fare, y = survived)) +
  geom_point(alpha = 0.5) +
  labs(title = "Fare vs Survived")

# combine individual plots with patchwork
(p1 | p2 | p3) / (p4 | p5)

Pair Plots

Pair Plot (or scatterplot matrix) is a grid of scatter plots that shows the relationships between all pairs of variables in a dataset. Each cell in the matrix is a scatter plot of two variables, while the diagonal typically shows the distribution of each variable. Pair plots provide a comprehensive view of the interactions between variables, helping to identify correlations and patterns across multiple dimensions.

# select relevant columns
numeric_columns <- titanic_data %>% dplyr::select(age, fare, siblings_spouses, parents_children, survived)

# create a pair plot
ggpairs(numeric_columns) +
  labs(title = "Pair Plot of Selected Titanic Dataset Variables")

Multivariate Analysis

Multivariate analysis enables you to uncover more complex patterns in your data by exploring relationships between multiple variables simultaneously.

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a set of orthogonal components (principal components) that capture the most variance in the data. By projecting the data onto these components, PCA reduces the number of variables while preserving as much of the original variability as possible. This is useful for simplifying datasets, revealing underlying structures, and visualizing high-dimensional data in lower dimensions.

# select numeric columns for PCA
numeric_columns <- titanic_data %>%
  dplyr::select(age, fare, siblings_spouses, parents_children, survived) %>%
  na.omit()  # remove rows with missing values

# perform PCA
pca_result <- prcomp(numeric_columns, scale. = TRUE)
pca_scores <- as.data.frame(pca_result$x)

# add the survived column to the pca scores
pca_scores$survived <- numeric_columns$survived
ggplot(pca_scores, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.7, aes(color = survived)) +
  labs(title = "PC1 vs PC2", x = "PC1", y = "PC2") +
  theme_minimal()

ggplot(pca_scores, aes(x = PC2, y = PC3)) +
  geom_point(alpha = 0.7, aes(color = survived)) +
  labs(title = "PC2 vs PC3", x = "PC2", y = "PC3") +
  theme_minimal()

ggplot(pca_scores, aes(x = PC1, y = PC3)) +
  geom_point(alpha = 0.7, aes(color = survived)) +
  labs(title = "PC1 vs PC3", x = "PC1", y = "PC3") +
  theme_minimal()

Clustering Analysis

Clustering Analysis involves grouping data points into clusters such that points within the same cluster are more similar to each other than to those in other clusters. The goal is to identify natural groupings in the data. Common methods include k-means clustering, which partitions data into a predefined number of clusters by minimizing the distance between points and their cluster centroids, and hierarchical clustering, which builds a hierarchy of clusters through either iterative merging (agglomerative) or splitting (divisive). Clustering helps in discovering patterns and segmenting data for further analysis.

# standardize or normalize the numerical features to ensure they are on a similar scale
scaled_data <- scale(numeric_columns)

K-means clustering

K-means clustering partitions a dataset into k clusters by iteratively assigning data points to the nearest centroid and updating the centroids based on the mean of assigned points. The process continues until the centroids stabilize or a set number of iterations is reached. It is efficient for large datasets but requires specifying k and can be sensitive to initial centroid placement and outliers.

# use methods like Elbow Method or Silhouette Analysis to decide the optimal number of clusters
fviz_nbclust(scaled_data, kmeans, method = "wss")

set.seed(123)  # for reproducibility

# perform k-means clustering
kmeans_result <- kmeans(scaled_data, centers = 3, nstart = 25)

# add cluster assignments to the original data
titanic_data$cluster <- kmeans_result$cluster

# visualize clusters
ggplot(titanic_data, aes(x = age, y = fare, color = as.factor(cluster))) +
  geom_point(alpha = 0.6) +
  labs(title = "K-means Clustering", color = "Cluster") +
  theme_minimal()

Hierarchical clustering

Hierarchical clustering creates a tree-like structure of clusters called a dendrogram. Agglomerative hierarchical clustering starts with each point as its own cluster and merges the closest clusters iteratively. Divisive hierarchical clustering starts with one large cluster and splits it iteratively. It doesn’t require specifying the number of clusters beforehand but can be computationally intensive for large datasets and does not allow for reassigning data points once clusters are formed.

# compute distance matrix
dist_matrix <- dist(scaled_data)

# perform hierarchnical clustering
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# cut tree to get clusters
clusters_hc <- cutree(hclust_result, k = 3)

# add cluster assignments to the original data
titanic_data$cluster_hc <- clusters_hc

# plot dendrogram
plot(hclust_result, main = "Hierarchical Clustering Dendrogram", labels = FALSE)

ggplot(titanic_data, aes(x = age, y = fare, color = as.factor(cluster_hc))) +
  geom_point(alpha = 0.6) +
  labs(title = "Hierarchical Clustering", color = "Cluster") +
  theme_minimal()

Dealing with High-Dimensional Data

PCA

We have already covered principal component analysis above.

ggplot(pca_scores, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.7, aes(color = survived)) +
  labs(title = "PC1 vs PC2", x = "PC1", y = "PC2") +
  theme_minimal()

Other options for dimensionality reduction are t-SNE, UMAP or MDS.